27 research outputs found
Network stack specialization for performance
Contemporary network stacks are masterpieces of generality, supporting a range of edge-node and middle-node functions. This generality comes at significant performance cost: current APIs, memory models, and implementations drastically limit the effectiveness of increasingly powerful hardware. Generality has historically been required to allow individual systems to perform many functions. However, as providers have scaled up services to support hundreds of millions of users, they have transitioned toward many thousands (or even millions) of dedicated servers performing narrow ranges of functions. We argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary. This paper presents Sandstorm, a clean-slate userspace network stack that exploits knowledge of web server semantics, improving throughput over current off-the-shelf designs while retaining use of conventional operating-system and programming frameworks. Based on Netmap, our novel approach merges application and network-stack memory models, aggressively amortizes stack-internal TCP costs based on application-layer knowledge, tightly couples with the NIC event model, and exploits low-latency hardware access. We compare our approach to the FreeBSD and Linux network stacks with nginx as the web server, demonstrating ∼3.5x throughput improvement, while experiencing low CPU utilization, linear scaling on multicore systems, and saturating current NIC hardware
Recommended from our members
Balancing Disruption and Deployability in the CHERI Instruction-Set Architecture (ISA)
For over two-and-a-half decades, dating to the first widespread commercial deployment of the Internet, commodity processor architectures have failed to provide robust and secure foundations for communication and commerce. This is in large part due to the omission of architectural features allowing efficient implementation of the Principle of Least Privilege, which dictates that software runs only with the rights it requires to operate [19, 20]. Without this support, the impact of inevitable vulnerabilities is multiplied as successful attackers gain easy access to unnecessary rights – and often, all rights – in software systems
CHERI Macaroons: Efficient, host-based access control for cyber-physical systems
Cyber-Physical Systems (CPS) often rely on network boundary defence as a primary means of access control; therefore, the compromise of one device threatens the security of all devices within the boundary. Resource and real-time constraints, tight hardware/software coupling, and decades-long service lifetimes complicate efforts for more robust, host-based access control mechanisms. Distributed capability systems provide opportunities for restoring access control to resource-owning devices; however, such a protection model requires a capability-based architecture for CPS devices as well as task compartmentalisation to be effective.
This paper demonstrates hardware enforcement of network bearer tokens using an efficient translation between CHERI (Capability Hardware Enhanced RISC Instructions) architectural capabilities and Macaroon network tokens. While this method appears to generalise to any network-based access control problem, we specifically consider CPS, as our method is well-suited for controlling resources in the physical domain. We demonstrate the method in a distributed robotics application and in a hierarchical industrial control application, and discuss our plans to evaluate and extend the method.Gates Cambridge Scholarshi
Structural analysis of whole-system provenance graphs
System based provenance generates traces captured from
various systems, a representation method for inferring these traces is
a graph. These graphs are not well understood, and current work focuses
on their extraction and processing, without a thorough characterization
being in place. This paper studies the topology of such graphs. We an-
alyze multiple Whole-system-Provenance graphs and present that they
have hubs-and-authorities model of graphs as well as a power law distri-
bution. Our observations allow for a novel understanding of the structure
of Whole-system-Provenance graphs.DARP
Into the depths of C: Elaborating the de facto standards
C remains central to our computing infrastructure. It is notionally defined by ISO standards, but in reality the properties of C assumed by systems code and those implemented by compilers have diverged, both from the ISO standards and from each other, and none of these are clearly understood. We make two contributions to help improve this error-prone situation. First, we describe an in-depth analysis of the design space for the semantics of pointers and memory in C as it is used in practice. We articulate many specific questions, build a suite of semantic test cases, gather experimental data from multiple implementations, and survey what C experts believe about the de facto standards. We identify questions where there is a consensus (either following ISO or differing) and where there are conflicts. We apply all this to an experimental C implemented above capability hardware. Second, we describe a formal model, Cerberus, for large parts of C. Cerberus is parameterised on its memory model; it is linkable either with a candidate de facto memory object model, under construction, or with an operational C11 concurrency model; it is defined by elaboration to a much simpler Core language for accessibility, and it is executable as a test oracle on small examples. This should provide a solid basis for discussion of what mainstream C is now: what programmers and analysis tools can assume and what compilers aim to implement. Ultimately we hope it will be a step towards clear, consistent, and accepted semantics for the various use-cases of C.We acknowledge funding from EPSRC grants EP/H005633 (Leadership Fellowship, Sewell) and EP/K008528 (REMS Programme Grant), and a Gates Cambridge Scholarship (Nienhuis). This work is also part of the CTSRD projects sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL), under contract FA8750-10-C-0237.This is the author accepted manuscript. The final version is available from the Association for Computing Machinery via http://dx.doi.org/10.1145/2908080.290808
Recommended from our members
Queues don't matter when you can JUMP them!
QJUMP is a simple and immediately deployable approach
to controlling network interference in datacenter
networks. Network interference occurs when congestion
from throughput-intensive applications causes queueing
that delays traffic from latency-sensitive applications.
To mitigate network interference, QJUMP applies Internet
QoS-inspired techniques to datacenter applications.
Each application is assigned to a latency sensitivity level
(or class). Packets from higher levels are rate-limited
in the end host, but once allowed into the network can
“jump-the-queue” over packets from lower levels. In settings
with known node counts and link speeds, QJUMP
can support service levels ranging from strictly bounded
latency (but with low rate) through to line-rate throughput
(but with high latency variance).
We have implemented QJUMP as a Linux Traffic Control
module. We show that QJUMP achieves bounded
latency and reduces in-network interference by up to
300×, outperforming Ethernet Flow Control (802.3x),
ECN (WRED) and DCTCP. We also show that QJUMP
improves average flow completion times, performing
close to or better than DCTCP and pFabric.This work was supported
by a Google Fellowship, EPSRC INTERNET Project
EP/H040536/1, Defense Advanced Research Projects
Agency (DARPA) and Air Force Research Laboratory
(AFRL), under contract FA8750-11-C-0249.This is the final published version. It first appeared at https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/grosvenor
Disk vertical bar Crypt vertical bar Net: rethinking the stack for high-performance video streaming
Conventional operating systems used for video streaming employ
an in-memory disk buffer cache to mask the high latency and low
throughput of disks. However, data from Netflix servers show that
this cache has a low hit rate, so does little to improve throughput.
Latency is not the problem it once was either, due to PCIe-attached
flash storage. With memory bandwidth increasingly becoming a bottleneck
for video servers, especially when end-to-end encryption is
considered, we revisit the interaction between storage and networking
for video streaming servers in pursuit of higher performance.
We show how to build high-performance userspace network services
that saturate existing hardware while serving data directly from
disks, with no need for a traditional disk buffer cache. Employing
netmap, and developing a new diskmap service, which provides safe
high-performance userspace direct I/O access to NVMe devices, we
amortize system overheads by utilizing efficient batching of outstanding
I/O requests, process-to-completion, and zerocopy operation. We
demonstrate how a buffer-cache-free design is not only practical,
but required in order to achieve efficient use of memory bandwidth
on contemporary microarchitectures. Minimizing latency between
DMA and CPU access by integrating storage and TCP control loops
allows many operations to access only the last-level cache rather
than bottle-necking on memory bandwidth. We illustrate the power
of this design by building Atlas, a video streaming web server that
outperforms state-of-the-art configurations, and achieves ~72Gbps
of plaintext or encrypted network traffic using a fraction of the
available CPU cores on commodity hardware
Recommended from our members
Firmament: Fast, Centralized Cluster Scheduling at Scale
Centralized datacenter schedulers can make high-quality placement decisions when scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of high latency at scale, which degrades response time for interactive tasks and reduces cluster utilization. This paper describes Firmament, a centralized scheduler that scales to over ten thousand machines at sub- second placement latency even though it continuously reschedules all tasks via a min-cost max-flow (MCMF) optimization. Firmament achieves low latency by using multiple MCMF algorithms, by solving the problem incrementally, and via problem-specific optimizations. Experiments with a Google workload trace from a 12,500-machine cluster show that Firmament improves placement latency by 20 x over Quincy [22], a prior centralized scheduler using the same MCMF optimiza- tion. Moreover, even though Firmament is centralized, it matches the placement latency of distributed schedulers for workloads of short tasks. Finally, Firmament exceeds the placement quality of four widely-used central- ized and distributed schedulers on a real-world cluster, and hence improves batch task response time by 6 x.This work was supported by a Google European Doc- toral Fellowship, by NSF award CNS-1413920, and by the Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL), under contract FA8750-11-C-0249
CHERI: a research platform deconflating hardware virtualisation and protection
Contemporary CPU architectures conflate virtualization and protection,
imposing virtualization-related performance, programmability,
and debuggability penalties on software requiring finegrained
protection. First observed in micro-kernel research, these
problems are increasingly apparent in recent attempts to mitigate
software vulnerabilities through application compartmentalisation.
Capability Hardware Enhanced RISC Instructions (CHERI) extend
RISC ISAs to support greater software compartmentalisation.
CHERI’s hybrid capability model provides fine-grained compartmentalisation
within address spaces while maintaining software
backward compatibility, which will allow the incremental deployment
of fine-grained compartmentalisation in both our most trusted
and least trustworthy C-language software stacks. We have implemented
a 64-bit MIPS research soft core, BERI, as well as a
capability coprocessor, and begun adapting commodity software
packages (FreeBSD and Chromium) to execute on the platform
Recommended from our members
CHERIvoke: Characterising pointer revocation using CHERI capabilities for temporal memory safety
A lack of temporal safety in low-level languages has led to an epidemic of use-after-free exploits. These have surpassed in number and severity even the infamous buffer-overflow exploits violating spatial safety. Capability addressing can directly enforce spatial safety for the C language by enforcing bounds on pointers and by rendering pointers unforgeable. Nevertheless, an efficient solution for strong temporal memory safety remains elusive.
CHERI is an architectural extension to provide hardware capability addressing that is seeing significant commercial and open- source interest. We show that CHERI capabilities can be used as a foundation to enable low-cost heap temporal safety by facilitating out-of-date pointer revocation, as capabilities enable precise and efficient identification and invalidation of pointers, even when using unsafe languages such as C. We develop CHERIvoke, a technique for deterministic and fast sweeping revocation to enforce temporal safety on CHERI systems. CHERIvoke quarantines freed data before periodically using a small shadow map to revoke all dangling pointers in a single sweep of memory, and provides a tunable trade-off between performance and heap growth. We evaluate the performance of such a system using high-performance x86 processors, and further analytically examine its primary overheads. When configured with a heap-size overhead of 25%, we find that CHERIvoke achieves an average execution-time overhead of under 5%, far below the overheads associated with traditional garbage collection, revocation, or page-table systems.EP/K026399/1, EP/P020011/1, EP/K008528/